Analyzing Hacker News Dataset

data-analysis
hacker-news
Author

Victor Omondi

Published

August 30, 2020

import pandas as pd
import re

Read the Dataset

hn = pd.read_csv('../hacker_news.csv')
hn.head()
id title url num_points num_comments author created_at
0 12224879 Interactive Dynamic Video http://www.interactivedynamicvideo.com/ 386 52 ne0phyte 8/4/2016 11:52
1 11964716 Florida DJs May Face Felony for April Fools' W... http://www.thewire.com/entertainment/2013/04/f... 2 1 vezycash 6/23/2016 22:20
2 11919867 Technology ventures: From Idea to Enterprise https://www.amazon.com/Technology-Ventures-Ent... 3 1 hswarna 6/17/2016 0:01
3 10301696 Note by Note: The Making of Steinway L1037 (2007) http://www.nytimes.com/2007/11/07/movies/07ste... 8 2 walterbell 9/30/2015 4:12
4 10482257 Title II kills investment? Comcast and other I... http://arstechnica.com/business/2015/10/comcas... 53 22 Deinos 10/31/2015 9:48

Dataset Shape

hn.shape
(20099, 7)
hn.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20099 entries, 0 to 20098
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            20099 non-null  int64 
 1   title         20099 non-null  object
 2   url           17659 non-null  object
 3   num_points    20099 non-null  int64 
 4   num_comments  20099 non-null  int64 
 5   author        20099 non-null  object
 6   created_at    20099 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB
hn.isnull().sum()
id                 0
title              0
url             2440
num_points         0
num_comments       0
author             0
created_at         0
dtype: int64
hn.describe().T
count mean std min 25% 50% 75% max
id 20099.0 1.131755e+07 696453.087424 10176908.0 10701720.0 11284523.0 11926127.0 12578975.0
num_points 20099.0 5.029663e+01 107.110322 1.0 3.0 9.0 54.0 2553.0
num_comments 20099.0 2.480303e+01 56.108639 1.0 1.0 3.0 21.0 1733.0

how many times is Python mentioned in the title of stories in our Hacker News dataset.

len([title for title in hn.title.to_list() if re.search('[Pp]ython', title)])
160
hn.title.str.contains('[Pp]ython').sum()
160

Titles that mention the programming language Ruby

hn.title[hn.title.str.contains('[Rr]uby')]
190                     Ruby on Google AppEngine Goes Beta
484           Related: Pure Ruby Relational Algebra Engine
1388     Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949     Rewriting a Ruby C Extension in Rust: How a Na...
2022     Show HN: CrashBreak  Reproduce exceptions as f...
2163                   Ruby 2.3 Is Only 4% Faster than 2.2
2306     Websocket Shootout: Clojure, C++, Elixir, Go, ...
2620                       Why Startups Use Ruby on Rails?
2645     Ask HN: Should I continue working a Ruby gem f...
3290     Ruby on Rails and the importance of being stup...
3749     Telegram.org Bot Platform Webhooks Server, for...
3874     Warp Directory (wd) unix command line tool for...
4026     OS X 10.11 Ruby / Rails users can install ther...
4163     Charles Nutter of JRuby Banned by Rubinius for...
4602     Quiz: Ruby or Rails? Matz and DHH were not abl...
5832     Show HN: An experimental Python to C#/Go/Ruby/...
6180     Shrine  A new solution for handling file uploa...
7171     JRuby+Truffle: Why its important to optimise t...
7235                                        Ruby or Rails?
7671                    How I hunted the most odd ruby bug
7776     Elixir obsoletes Ruby, Erlang and Clojure in o...
7870                            Elixir and Ruby Comparison
8502     Show HN: Di-ary  a math note-taking app built ...
10212               Ruby has been fast enough for 13 years
11060    Show HN: VeryAnts: Probabilistic Integer Arith...
11534                             The Ruby Code of Conduct
11622    FasterPath: Faster Pathname Handling for Ruby ...
12061       Ask HN: What's your favorite ruby HTTP client?
12091    Show HN: Automated Bundle Update with Descript...
12114                                         Awesome Ruby
12543    Ruby Bug: SecureRandom should try /dev/urandom...
12987    Show HN: Klipse  code evaluator pluggable on a...
13550    Matz: I cannot accept the CoC for the Ruby com...
13650                  Programs that rewrite Ruby programs
14798                  Ruby Wrapper for Telegram's Bot API
14980                    A Ruby gem for genetic algorithms
16093                          Master Ruby Web APIs Is Out
16149         Ruru: native Ruby extensions written in Rust
16327                   Make Ruby Great Again [transcript]
16422                                 Object Oriented Ruby
16536                           Ruby Deoptimization Engine
16875                         Video: Make Ruby Great Again
17072    A coupon/deals site built using Roda gem for Ruby
17510                        Table Flip on Ruby Exceptions
18877    Using Rust with Ruby, a Deep Dive with Yehuda ...
19077                           Python is Better than Ruby
19224                    Modern concurrency tools for Ruby
19743    Using a Neural Network to Train a Ruby Twitter...
Name: title, dtype: object

how many titles in our dataset mention email or e-mail

hn.title[hn.title.str.contains('e-?mail')]
119      Show HN: Send an email from your shell to your...
313          Disposable emails for safe spam free shopping
1361     Ask HN: Doing cold emails? helps us prove this...
1750     Protect yourself from spam, bots and phishing ...
2421                    Ashley Madison hack treating email
                               ...                        
18098    House panel looking into Reddit post about Cli...
18583    Mailgen  Generates clean, responsive HTML for ...
18847    Show HN: Crisp iOS keyboard for email and text...
19303    Ask HN: Why big email providers don't sign the...
19446    Tell HN: Secure email provider Riseup will run...
Name: title, Length: 86, dtype: object

how many titles in our dataset have tags?

hn.title[hn.title.str.contains('\[\w+\]')]
66       Analysis of 114 propaganda sources from ISIS, ...
100      Munich Gunman Got Weapon from the Darknet [Ger...
159           File indexing and searching for Plan 9 [pdf]
162      Attack on Kunduz Trauma Centre, Afghanistan  I...
195                 [Beta] Speedtest.net  HTML5 Speed Test
                               ...                        
19763    TSA can now force you to go through body scann...
19867                       Using Pony for Fintech [video]
19947                                Swift Reversing [pdf]
19979    WSJ/Dowjones Announce Unauthorized Access Betw...
20089    Users Really Do Plug in USB Drives They Find [...
Name: title, Length: 444, dtype: object

we were able to calculate that 444 of the 20,100 Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset? In order to do this, we’ll need to use capture groups.

# extract all of the tags from the Hacker News titles and build a frequency table of those tags.

hn['title'].str.extract(r'\[(\w+)\]')[0].value_counts().head()
pdf       276
video     111
2015        3
audio       3
slides      2
Name: 0, dtype: int64
def first_10_matches(pattern):
    """
    Return the story titles that match
    the provided regular expression
    """
    return titles[titles.str.contains(pattern)]

Titles that contain Java

hn.title[hn.title.str.contains(r'[Jj]ava[^Ss]')]
436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1840                     Adopting RxJava on the Airbnb App
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
2910                 2016 JavaOne Intel Keynote  32mn Talk
3452     What are the Differences Between Java Platform...
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
5947                                        JavaFX is dead
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope...
7481     Ask HN: Beside Java what languages have a stro...
8100        Advantages of Functional Programming in Java 8
8135     Show HN: Rogue AI Dungeon, javacript bot scrip...
8447                  Show HN: Java multicore intelligence
8487     Why IntelliJ IDEA is hailed as the most friend...
8984     Ask HN: Should Learn/switch to JavaScript Prog...
8987     Last-khajiit/vkb: Java bot for vk.com competit...
10529             Angular 2 coming to Java, Python and PHP
11454    Ask HN: Java or .NET for a new big enterprise ...
11902                         The Java Deserialization Bug
12382          Ask HN: Why does Java continue to dominate?
12582    Java Memory Model Examples: Good, Bad and Ugly...
12711    Oracle seeks $9.3B for Googles use of Java in ...
13048        A high performance caching library for Java 8
13105    Show HN: Backblaze-b2 is a simple java library...
13150             Java Tops TIOBE's Popular-Languages List
13170    Show HN: Tablesaw: A Java data-frame for 500M-...
13272      Java StringBuffer and StringBuilder performance
13620    1M Java questions have now been asked on Stack...
13839        Ask HN: Hosting a Java Spring web application
13843                                 Var and val in Java?
13844               Answerz.com  Java and J2ee Programming
13930     Java 8s new Optional type doesn't solve anything
13934    Java 6 vs. Java 7 vs. Java 8 between 2013  201...
15257                       Oracle and the fall of Java EE
15868                 Java generics never cease to impress
16023    Will you use ReactJS with a REST service inste...
16932       Swift versus Java: the bitset performance test
16948          Show HN: Bt  0-hassle BitTorrent for Java 8
17579                Java Lazy Streamed Zip Implementation
18407    Show HN: Scala idioms in Java: cases, patterns...
19481    Show HN: Adding List Comprehension in Java - E...
19735          Java Named Top Programming Language of 2015
Name: title, dtype: object
hn.title[hn.title.str.contains(r'\b[Jj]ava\b')]
436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1023                          Pippo  Web framework in Java
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
3228                               Comparing Rust and Java
3452     What are the Differences Between Java Platform...
3627                     Friends don't let friends do Java
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope...
7481     Ask HN: Beside Java what languages have a stro...
7686             Insider: Oracle has lost interest in Java
8100        Advantages of Functional Programming in Java 8
8447                  Show HN: Java multicore intelligence
8487     Why IntelliJ IDEA is hailed as the most friend...
8984     Ask HN: Should Learn/switch to JavaScript Prog...
8987     Last-khajiit/vkb: Java bot for vk.com competit...
10529             Angular 2 coming to Java, Python and PHP
11454    Ask HN: Java or .NET for a new big enterprise ...
11902                         The Java Deserialization Bug
12382          Ask HN: Why does Java continue to dominate?
12582    Java Memory Model Examples: Good, Bad and Ugly...
12711    Oracle seeks $9.3B for Googles use of Java in ...
12730                              Show HN: Shazam in Java
13048        A high performance caching library for Java 8
13105    Show HN: Backblaze-b2 is a simple java library...
13150             Java Tops TIOBE's Popular-Languages List
13170    Show HN: Tablesaw: A Java data-frame for 500M-...
13272      Java StringBuffer and StringBuilder performance
13620    1M Java questions have now been asked on Stack...
13839        Ask HN: Hosting a Java Spring web application
13843                                 Var and val in Java?
13844               Answerz.com  Java and J2ee Programming
13930     Java 8s new Optional type doesn't solve anything
13934    Java 6 vs. Java 7 vs. Java 8 between 2013  201...
14393              JavaScript is immature compared to Java
14847    Show HN: TurboRLE: Bringing Turbo Run Length E...
15257                       Oracle and the fall of Java EE
15868                 Java generics never cease to impress
16023    Will you use ReactJS with a REST service inste...
16932       Swift versus Java: the bitset performance test
16948          Show HN: Bt  0-hassle BitTorrent for Java 8
17458                            Super Mario clone in Java
17579                Java Lazy Streamed Zip Implementation
18407    Show HN: Scala idioms in Java: cases, patterns...
19481    Show HN: Adding List Comprehension in Java - E...
19735          Java Named Top Programming Language of 2015
Name: title, dtype: object

how many titles have tags at the start versus the end of the story title in our Hacker News dataset.

hn.title.str.contains(r'^\[\w+\]').sum()
15
hn.title.str.contains(r'\[\w+\]$').sum()
417

count the number of times that email is mentioned in story titles.

hn.title.str.contains(r'\be\-?\s?mails?\b', flags=re.I).sum()
141

We’ll continue to analyze and count mentions of different programming languages in the dataset, and then we’ll finish by extracting the different components of the URLs submitted to Hacker News.

count the number of times that sql is mentioned in story titles.

hn.title.str.contains(r'sql', flags=re.I).sum()
108
hn_sql = hn[hn.title.str.contains(r'\w+sql', flags=re.I)].copy()
hn_sql['flavor'] = hn['title'].str.extract(r'(\w+sql)', flags=re.I)[0].str.lower()
sql_pivot = hn_sql.pivot_table(index='flavor', values='num_comments')
sql_pivot
num_comments
flavor
cloudsql 5.000000
memsql 14.000000
mysql 12.230769
nosql 14.529412
postgresql 25.962963
sparksql 1.000000

version of Python that is mentioned most often in our dataset

hn.title.str.extract(r'python ([\d\.]+)', flags=re.I)[0].value_counts().to_dict()
{'3': 10,
 '2': 3,
 '3.5': 3,
 '3.6': 2,
 '2.7': 1,
 '8': 1,
 '1.5': 1,
 '3.5.0': 1,
 '4': 1}

C programming titles

hn.title[hn.title.str.contains(r'(?!<series)\bc\b(?![\.\+])', flags=re.I)]
221                   MemSQL (YC W11) Raises $36M Series C
365                       The new C standards are worth it
444            Moz raises $10m Series C from Foundry Group
521           Fuchsia: Micro kernel written in C by Google
1307             Show HN: Yupp, yet another C preprocessor
                               ...                        
18549            Show HN: An awesome C library for Windows
18649                 Python vs. C/C++ in embedded systems
18689                    Philz Coffee raises $45M Series C
19151                      Ask HN: How to learn C in 2016?
19933    Lightweight C library to parse NMEA 0183 sente...
Name: title, Length: 105, dtype: object

make all the different variations of “email” in the dataset uniform.

hn['title'] = hn.title.str.replace(r'e[\-\s]?mail','email', flags=re.I)
hn.title[hn.title.str.contains('email')]
119      Show HN: Send an email from your shell to your...
161      Computer Specialist Who Deleted Clinton emails...
174                                        email Apps Suck
261      emails Show Unqualified Clinton Foundation Don...
313          Disposable emails for safe spam free shopping
                               ...                        
19303    Ask HN: Why big email providers don't sign the...
19395    I used HTML email when applying for jobs, here...
19446    Tell HN: Secure email provider Riseup will run...
19838                       Petition to Open Sourcemailbox
19905    Gmail Will Soon Warn Users When emails Arrive ...
Name: title, Length: 151, dtype: object

extract components of URLs from our dataset.

most stories on Hacker News contain a link to an external resource. Once we have extracted the domains, we will be building a frequency table so we can determine the most popular domains. There are over 7,000 unique domains in our dataset, so to make the frequency table easier to analyze, we’ll look at only the top 20 domains

hn.url.str.extract(r'https?://([\w\-\.]+)', flags=re.I)[0].value_counts()
github.com                1008
medium.com                 825
www.nytimes.com            525
www.theguardian.com        248
techcrunch.com             245
                          ... 
pss-camera.appspot.com       1
www.mrrrgn.com               1
ams-ix.net                   1
www.codeshare.co.uk          1
lambdaschool.com             1
Name: 0, Length: 7251, dtype: int64
hn_urls=hn.url.str.extract(r'(?P<protocol>\w+://(?P<domain>[\w\.\-]+)/?(?P<path>.*))', flags=re.I)
hn_urls.head()
protocol domain path
0 http://www.interactivedynamicvideo.com/ www.interactivedynamicvideo.com
1 http://www.thewire.com/entertainment/2013/04/f... www.thewire.com entertainment/2013/04/florida-djs-april-fools-...
2 https://www.amazon.com/Technology-Ventures-Ent... www.amazon.com Technology-Ventures-Enterprise-Thomas-Byers/dp...
3 http://www.nytimes.com/2007/11/07/movies/07ste... www.nytimes.com 2007/11/07/movies/07stein.html?_r=0
4 http://arstechnica.com/business/2015/10/comcas... arstechnica.com business/2015/10/comcast-and-other-isps-boost-...
hn_urls.domain.value_counts()
github.com                1008
medium.com                 825
www.nytimes.com            525
www.theguardian.com        248
techcrunch.com             245
                          ... 
pss-camera.appspot.com       1
www.mrrrgn.com               1
ams-ix.net                   1
www.codeshare.co.uk          1
lambdaschool.com             1
Name: domain, Length: 7251, dtype: int64